Recall: In ridge and LASSO regression, we add a penalty term that is balanced by a parameter \(\lambda\)
\[\text{minimize:} \; \; RSS + \lambda * \text{penalty}\]
What is the “best” choice of \(\lambda\)?
Last class, we tried two LASSO models:
We found that larger penalty parameter led to more coefficients of 0 (i.e., excluded variables):
# A tibble: 63 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) 3.81 0.1
2 Creative 0.0252 0.1
3 Energetic 0.000327 0.1
4 Tingly 0 0.1
5 Euphoric 0.115 0.1
6 Relaxed 0.188 0.1
7 Aroused 0 0.1
8 Happy 0.268 0.1
9 Uplifted 0.108 0.1
10 Hungry 0 0.1
# ℹ 53 more rows
# A tibble: 63 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) 4.32 0.5
2 Creative 0 0.5
3 Energetic 0 0.5
4 Tingly 0 0.5
5 Euphoric 0 0.5
6 Relaxed 0 0.5
7 Aroused 0 0.5
8 Happy 0 0.5
9 Uplifted 0 0.5
10 Hungry 0 0.5
# ℹ 53 more rows
\(\lambda = 0.1 \rightarrow\) 7 variables kept, 56 dropped
\(\lambda = 0.5 \rightarrow\) 0 variables kept, 63 dropped
Prediction: What penalty leads to the best cross-validated prediction accuracy?
Let’s use tune() and tune_grid() to find out!
One wrinkle in tuning: The automatic grid chooses values on a log scale, not evenly spaced:
Don’t forget about interpretability!
How many variables do we want to retain?
Because our regularized methods are de-prioritizing RMSE, they will rarely give us better prediction residuals.
So, why do it?
LASSO \(\rightarrow\) Variable selection
If we can achieve nearly the same predictive power with many fewer variables, we have a more interpretable model.
Open Activity-Variable-Selection from last class
Tweak your choice of penalty in your LASSO regression until you get approximately the same number of variables as you did via stepwise selection.
Are they the same variables?
Model Stability
Consider dividing the dataset into 3 randomly split subsets.
We fit a linear model on all predictors for each subset.
Should we expect similar answers?
When we have many variables, there is probably some collinearity.
Meaning, combinations of variables contain the same information.
This makes the model is very unstable!
So, what variables should we use?
Why does the ridge penalty help?
It reduces the variance of the coefficient estimates.
It lets all the variables “share the load” instead of putting too much weight on any one coefficient.
Lowering the variance of the estimates increases the bias.
In other words - we aren’t prioritizing prediction or RMSE anymore. Our y-hats are not as close to our y’s.
Elastic Net
What if we want to reduce the number of variables AND reduce the coefficient variance?
We’ll just use two penalty terms:
\[ \text{RSS} + \frac{\lambda}{2} \times \text{(LASSO penalty)} + \frac{\lambda}{2} \times \text{(Ridge penalty)}\]
When we do half-and-half, this is called “Elastic Net”.
Why half-and-half? Why not 1/3 and 2/3? 1/4 and 3/4???
\[\text{RSS} + \alpha \text{ } (\lambda \times \text{LASSO penalty}) + (1 - \alpha) (\lambda \times \text{Ridge penalty})\]
\(\alpha\) is the mixture parameter.
Open Activity-Variable-Selection from last class.
Tune both the mixture and the penalty parameters.
Plot the RMSE and / or R-squared across a few penalties (at one mixture)
Plot the RMSE and / or R-squared across a few mixtures (at one penalty)
Open Activity-Mixture-Models.qmd
Recall: We wanted to predict the Type of cannabis from the descriptor words.
Consider only Indica vs. Sativa (remove Hybrid)
Can you combine logistic regression with LASSO to tell me which words best separate Indica and Sativa?
How does this compare to what you find with a decision tree?